Part 2: Intro Pandas

Preface

🐼 ?

  • Popular library for analyzing tabular data.
    • Tabular: rectangular, has rows and columns.
  • Expressive and full-featured.
    • Readable syntax
    • Lots of convenience functions

🐼: panel data, apparently. 1

Pros and Cons

Reasons to use

  • Support for many file types
  • Integrated into data analysis ecosystem
  • Balance of verbosity and function
  • Complex time-series and hierarchically indexed data functionality

Reasons not to use

  • Struggles with larger datasets (>1M rows)
  • Significant overhead
  • Memory-intensive
  • CPU-only

Running Scenario

  • Analyzing student scores and sending summary reports
  • Two datasets:
    • data/scores.csv: students’ scores in 5 subjects
    • data/contacts.csv: students’ contact details

Functionality Covered

  • How do I read in data from files?
  • How do I select and filter values?
  • How do I perform calculations on data?
  • How do I merge/combine data from multiple sources?

Importing Libraries

A brief aside on a first step that tends to get overlooked.

Libraries

  • Functionality of base Python is limited for data analysis workflow
  • Python language is extensible, benefits from enormous ecosystem
  • pandas is a library for data analysis

import

  • To use a library, we need to import it:
import pandas
  • In Jupyter, we can use %who to list objects in the global namespace
%who
ojs_define   pandas  
  • The methods within pandas are now accessible as methods of the imported module.
pandas.__version__
'2.0.3'

from

  • We can use the from command to import objects and methods directly from modules into the global namespace.
from pandas import __version__
print(__version__)
2.0.3

as

  • The default approach in Python is to keep fewer objects in the global namespace.
  • This means always prefacing the function or class we want to use with the prefix name.
  • This can get verbose, so we often abbreviate library names with as:
import pandas as pd
print(pd)
print(pd.DataFrame)
<module 'pandas' from '/home/muhark/anaconda3/envs/ds3/lib/python3.11/site-packages/pandas/__init__.py'>
<class 'pandas.core.frame.DataFrame'>

Contrast with R

  • In R, the default approach is to import all methods from a library into the global namespace.
  • If we did this with pandas it would look like this:
from pandas import *
%who
ArrowDtype   BooleanDtype    Categorical     CategoricalDtype    CategoricalIndex    DataFrame   DateOffset  DatetimeIndex   DatetimeTZDtype     
ExcelFile    ExcelWriter     Flags   Float32Dtype    Float64Dtype    Grouper     HDFStore    Index   IndexSlice  
Int16Dtype   Int32Dtype  Int64Dtype  Int8Dtype   Interval    IntervalDtype   IntervalIndex   MultiIndex  NA  
NaT  NamedAgg    Period  PeriodDtype     PeriodIndex     RangeIndex  Series  SparseDtype     StringDtype     
Timedelta    TimedeltaIndex  Timestamp   UInt16Dtype     UInt32Dtype     UInt64Dtype     UInt8Dtype  api     array   
arrays   bdate_range     concat  crosstab    cut     date_range  describe_option     errors  eval    
factorize    from_dummies    get_dummies     get_option  infer_freq  interval_range  io  isna    isnull  
json_normalize   lreshape    melt    merge   merge_asof  merge_ordered   notna   notnull     offsets     
ojs_define   option_context  options     pandas  pd  period_range    pivot   pivot_table     plotting    
qcut     read_clipboard  read_csv    read_excel  read_feather    read_fwf    read_gbq    read_hdf    read_html   
read_json    read_orc    read_parquet    read_pickle     read_sas    read_spss   read_sql    read_sql_query  read_sql_table  
read_stata   read_table  read_xml    reset_option    set_eng_float_format    set_option  show_versions   test    testing     
timedelta_range  to_datetime     to_numeric  to_pickle   to_timedelta    tseries     unique  value_counts    wide_to_long    

Cleanup

NB: %-commands only work in Jupyter (IPython).

%reset -f
import pandas as pd # normal way of importing

Getting Your Data

Manual DataFrame

  • Can be constructed manually from a dict of equal-length lists.
scores_dict = {
    'student_id': ['5a01', '5a12', '5b05', '5b10', '5e04'],
    'math':    [95, 78, 85, 90, 88],
    'english': [97, 91, 86, 89, 90],
    'history': [80, 89, 94, 87, 85],
    'biology': [81, 86, 88, 99, 88],
    'art':     [86, 81, 82, 91, 84]
}

Manual DataFrame

df = pd.DataFrame(data=scores_dict)
df
student_id math english history biology art
0 5a01 95 97 80 81 86
1 5a12 78 91 89 86 81
2 5b05 85 86 94 88 82
3 5b10 90 89 87 99 91
4 5e04 88 90 85 88 84

Reading from a file

pandas comes with functions for reading and writing to all kinds of data formats. A quick list can be viewed using tab completion:

pd.read_<TAB>
read_clipboard() read_gbq()       read_parquet()   read_sql_query()
read_csv()       read_hdf()       read_pickle()    read_sql_table()
read_excel()     read_html()      read_sas()       read_stata()    
read_feather()   read_json()      read_spss()      read_table()    
read_fwf()       read_orc()       read_sql()       read_xml()      

Data IO: read_csv

df = pd.read_csv("../data/scores.csv")
df
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Selecting Data

  • “What were everyone’s math scores?”
  • “What did student 5a12 get on all subjects?”
  • “What did 5a12 and 5e04 get on history and art?”

Columns and Index

  • Entries indexed by columns and rows
student_id math english history biology art
0 5a01 95 97 80 81 86
1 5a12 78 91 89 86 81
2 5b05 85 86 94 88 82
3 5b10 90 89 87 99 91
4 5e04 88 90 85 88 84

df.columns and df.index

  • These can be accessed through following:
df.columns
Index(['student_id', 'math', 'english', 'history', 'biology', 'art'], dtype='object')
df.index
RangeIndex(start=0, stop=5, step=1)
  • By default dataframes have a numerical index.

Setting the Index

  • df.set_index() returns dataframe with new index
df = df.set_index('student_id')
df.index
Index(['5a01', '5a12', '5b05', '5b10', '5e04'], dtype='object', name='student_id')
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

.loc

  • Use .loc for name-based indexing.
  • General syntax: .loc[<INDEX>, <COLS>]
    • <INDEX> and <COLS> correspond to the index and column names.
    • They can be a single value, a list, or : to indicate “all”.
  • Let’s learn by example:

Row

  • “What did 5a01 get on all (:) exams?”
df.loc['5a01', :]
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Column

  • “What did everyone (:) get on history?”
df.loc[:, 'history']
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Multiple Rows

  • “What did 5a01 and 5a12 (['5a01', '5a12']) get on all (:) exams?”
df.loc[['5a01', '5a12'], :]
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Multiple Columns

  • What did everyone (:) get on art and history (['art', 'history'])?
df.loc[:, ['art', 'history']]
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Label Order

Note the order of labels changes the order of columns:

df.loc[:, ['art', 'history']]
art history
student_id
5a01 86 80
5a12 81 89
5b05 82 94
5b10 91 87
5e04 84 85

Cell

  • What did 5a01 get in history?
df.loc['5a01', 'history']
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Multiple Values

  • What did 5a01 get in history and art?
df.loc['5a12', ['history', 'art']]
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

cont.

  • What did 5a01, 5a12, and 5b05 get in biology and art?
df.loc[['5a01', '5a12', '5b05'], ['biology', 'art']]
student_id math english history biology art
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84

Filtering

“The overall scores of students who got 90 or higher in math””

  • Create a boolean array
df.loc[:, 'math'] >= 90
student_id
5a01     True
5a12    False
5b05    False
5b10     True
5e04    False
Name: math, dtype: bool

.loc Filtering

  • .loc can take boolean arrays:
good_math = df.loc[:, 'math'] >= 90
df.loc[good_math, :]
math english history biology art
student_id
5a01 95 97 80 81 86
5b10 90 89 87 99 91
  • Shorter syntax:
df.loc[df['math'].ge(90), :]

Combining Indexers and Filters

  • history scores where math and art \(>=\) 85
df.loc[(df.loc[:, 'math']>=85) &
       (df.loc[:, 'art']>=85),
       'history']
student_id
5a01    80
5b10    87
Name: history, dtype: int64

.iloc: locational indexing

  • First 2 rows (:2 in Python)
  • Last 3 columns (-3:)
df.iloc[:2, -3:]
history biology art
student_id
5a01 80 81 86
5a12 89 86 81

Operations

  • “Is this score over 90?”
  • “What was the average math score?”
  • “What was 5b10’s maximum score?””
  • “What was the lowest score in each exam?”

Some Terminology

Clarifying what I mean in this lecture when I say:

  • Scalar: a single value
  • Non-scalar: a data structure capable of containing multiple data points
  • Argument(s): the input(s) to a function

Series and DataFrame

  • pd.Series are 1-dimensional
  • pd.DataFrame are 2-dimensional

Series

  • Passing a scalar to an indexer (.loc) on a DataFrame returns a Series
df.loc[:, 'biology']
student_id
5a01    81
5a12    86
5b05    88
5b10    99
5e04    88
Name: biology, dtype: int64

DataFrame

  • Passing a list returns a DataFrame
df.loc[:, ['biology']]
biology
student_id
5a01 81
5a12 86
5b05 88
5b10 99
5e04 88

Arguments at Three Levels

  • Scalar: “Is this score over 90?”
def greater_than_90(x):
    return x > 90
  • Series: “What is the maximum score on one exam?”
def maximum(scores):
    return max(scores)
  • Dataframe: “How many individual scores?”
def how_many_elements(df):
    rows, cols = df.shape
    return rows*cols

Convenience Functions

  • Series and DataFrame objects have many common operations built-in:
    • \(\ne\) (.ne()), \(\gt\) (.gt()), …
    • mean (.mean()), median (.median()), standard deviation (.std()), …
  • These tend to be optimized.
  • See documentation for list.

“What was the average math score?”

df.loc[:, 'math'].mean()
87.2

apply and applymap

  • Custom functions can be applied to Series and DataFrames using their .apply method.
  • pd.Series.apply: functions with scalar arguments
  • pd.DataFrame.apply: functions with pd.Series as an argument
    • Specify axis: 0 is row-wise, 1 is column-wise
  • pd.DataFrame.applymap: functions with scalar arguments

“What was the lowest score for each exam and student??”

  • Lowest per exam (axis=0)
df.min(axis=0)
math       78
english    86
history    80
biology    81
art        81
dtype: int64
  • Lowest per student (axis=1, using built-in min):
df.apply(min, axis=1)
student_id
5a01    80
5a12    78
5b05    82
5b10    87
5e04    84
dtype: int64

Convert Scores to A-F Ranking

def convert_score(x):
    score = ''
    if x >= 90:
        score = 'A'
    elif x >= 80:
        score = 'B'
    elif x >= 70:
        score = 'C'
    elif x >= 60:
        score = 'D'
    else:
        score = 'F'
    return score

Convert Scores to A-F Ranking

df.applymap(convert_score)
math english history biology art
student_id
5a01 A A B B B
5a12 C A B B B
5b05 B B A B B
5b10 A B B A A
5e04 B A B B B

Combining Data

  • How do we combine data from multiple sources?

Two Ways

  • Concatenating: sticking it together
  • Joining: merging on common “key”

We forgot two students!

df_extra = pd.read_csv('../data/scores_extra.csv',
                       index_col='student_id')
df_extra
math english history biology art
student_id
5b11 93 84 82 95 91
5c01 88 73 72 80 84

Concatenation

pd.concat([df, df_extra], axis=0)
math english history biology art
student_id
5a01 95 97 80 81 86
5a12 78 91 89 86 81
5b05 85 86 94 88 82
5b10 90 89 87 99 91
5e04 88 90 85 88 84
5b11 93 84 82 95 91
5c01 88 73 72 80 84

Different Columns

  • .reset_index() moves the index into a column:
df_extra.reset_index()
student_id math english history biology art
0 5b11 93 84 82 95 91
1 5c01 88 73 72 80 84

Different Columns

  • Concatenating with different columns creates NA values:
pd.concat([df_extra, df_extra.reset_index()])
math english history biology art student_id
5b11 93 84 82 95 91 NaN
5c01 88 73 72 80 84 NaN
0 93 84 82 95 91 5b11
1 88 73 72 80 84 5c01

Contact Details

  • We want to match scores to students’ first names
pd.read_csv('../data/contacts.csv')
StudentID FirstName LastName Age Gender Scholarship Email
0 5a01 Alice Smith 20 Female True alice@gmail.com
1 5a02 Bob Higgins 21 Male True bob@hotmail.com
2 5b05 Charlie Wylie 22 Male False charlie@yahoo.com
3 5b10 David Card 20 Male False david@gmail.com
4 5e04 Eva Longman 23 Female False eva@outlook.com
5 5b11 Frankie Krueger 20 Female True frankie@outlook.com
6 6a01 Gerald Nivea 19 Male False gerald@gmail.com

Preparing data for joining

  • The two dataframes have one column in common: the student ID.
    • In one it’s StudentID, in the other student_id.
  • For visual simplicity, I move student_id back to the columns
df_contact = pd.read_csv('../data/contacts.csv')
df_scores = df.reset_index()

Key

  • Note that the two dataframes do not have the exact same values of student ID!
df_contact['StudentID']
0    5a01
1    5a02
2    5b05
3    5b10
4    5e04
5    5b11
6    6a01
Name: StudentID, dtype: object
df_scores['student_id']
0    5a01
1    5a12
2    5b05
3    5b10
4    5e04
Name: student_id, dtype: object

Merge Syntax

pd.merge(
    how=<inner/left/right/outer>
    left=df_contact[['FirstName', 'StudentID']],
    right=df_scores[['student_id', 'history']],
    left_on='StudentID',
    right_on='student_id')
  • We specify the two dataframes as left and right
  • We specify the common key for each dataframe
    • Note if they were the same, we could use the argument on
  • how is easiest to explain visually

how='inner'

FirstName StudentID student_id history
0 Alice 5a01 5a01 80
1 Charlie 5b05 5b05 94
2 David 5b10 5b10 87
3 Eva 5e04 5e04 85

how='left'

FirstName StudentID student_id history
0 Alice 5a01 5a01 80.0
1 Bob 5a02 NaN NaN
2 Charlie 5b05 5b05 94.0
3 David 5b10 5b10 87.0
4 Eva 5e04 5e04 85.0
5 Frankie 5b11 NaN NaN
6 Gerald 6a01 NaN NaN

how='right'

FirstName StudentID student_id history
0 Alice 5a01 5a01 80
1 NaN NaN 5a12 89
2 Charlie 5b05 5b05 94
3 David 5b10 5b10 87
4 Eva 5e04 5e04 85

how='outer'

FirstName StudentID student_id history
0 Alice 5a01 5a01 80.0
1 Bob 5a02 NaN NaN
2 Charlie 5b05 5b05 94.0
3 David 5b10 5b10 87.0
4 Eva 5e04 5e04 85.0
5 Frankie 5b11 NaN NaN
6 Gerald 6a01 NaN NaN
7 NaN NaN 5a12 89.0

Exploratory Analysis

British Election Study Data

link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data.csv'
bes_df = pd.read_csv(link)

First-Look Functions

  • When working with data, your first step should always be getting to know the data
    • Manually inspect samples of the data.
    • Check dimensions: are they expected?
    • Check data types: are they expected?
    • Tabulate variables: what are the levels?

Inspect first 5 rows

bes_df.head()
voted tory_vote election_interest civic_duty party_id ideo_lr ideo_pc1 ideo_pc2 vote_leave class female edlevel region
0 1 0 1 1 Labour 3 -1.204655 1.158077 1 2 1 0 East Midlands
1 1 0 1 5 Labour 0 3.077969 1.322138 0 2 0 1 East Midlands
2 0 0 2 2 Refused 5 -0.634762 -4.671532 1 2 0 0 East Midlands
3 1 0 1 4 Labour 6 3.686122 1.205449 1 2 1 0 East Midlands
4 1 0 1 4 Labour 2 0.139499 1.408508 0 2 1 4 East Midlands

What are the dimensions of the dataset?

bes_df.shape
(2194, 13)

What data types are each of the columns?

bes_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2194 entries, 0 to 2193
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   voted              2194 non-null   int64  
 1   tory_vote          2194 non-null   int64  
 2   election_interest  2194 non-null   int64  
 3   civic_duty         2194 non-null   int64  
 4   party_id           2194 non-null   object 
 5   ideo_lr            2194 non-null   int64  
 6   ideo_pc1           2194 non-null   float64
 7   ideo_pc2           2194 non-null   float64
 8   vote_leave         2194 non-null   int64  
 9   class              2194 non-null   int64  
 10  female             2194 non-null   int64  
 11  edlevel            2194 non-null   int64  
 12  region             2194 non-null   object 
dtypes: float64(2), int64(9), object(2)
memory usage: 223.0+ KB

What unique values do each column contain?

  • pd.Series.value_counts() for tabulation
bes_df['female'].value_counts()
female
1    1193
0    1001
Name: count, dtype: int64
bes_df['region'].value_counts()
region
North West            304
South East            282
West Midlands         227
Eastern               226
London                212
Scotland              191
Yorkshire & Humber    187
South West            167
East Midlands         156
Wales                 129
North East            113
Name: count, dtype: int64

Recap

pandas

  • File I/O (reading/writing data formats)
  • Indexing, slicing, filtering
  • Operations on data
  • Combining/merging data
  • Exploratory look

Additional Resources

Textbook

The following sections of Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython, 2nd edition are relevant to this lecture:

  • 5.*: Getting Started with pandas
  • 6.*: Data Loading, Storage and File Formats
  • 7.1-2: Data Cleaning and Preparation
  • 7.3: String Manipulation
  • 12.1: Categorical Data

Blogs, Docs